-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[Perf] Early return in KVCacheManager.allocate_slots #29206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jialin Ouyang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if ( | ||
| num_blocks_to_allocate == 0 | ||
| and new_computed_block_list is self.empty_kv_cache_blocks.blocks | ||
| ): | ||
| # Early return as no new blocks needed to be allocated | ||
| return self.empty_kv_cache_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipped sliding-window cleanup when no new blocks allocated
allocate_slots now returns early when num_blocks_to_allocate==0 and no prefix-cache hits, so coordinator.remove_skipped_blocks() is never called on those steps. For sliding-window/ChunkedLocal attention this call is what frees blocks that have fallen outside the window; skipping it leaves those blocks held until the request hits a block boundary or finishes, inflating block_pool usage and causing unnecessary evictions or scheduling failures for long prompts that keep generating within an existing block.
Useful? React with 👍 / 👎.
|
|
||
| if ( | ||
| num_blocks_to_allocate == 0 | ||
| and new_computed_block_list is self.empty_kv_cache_blocks.blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| and new_computed_block_list is self.empty_kv_cache_blocks.blocks | |
| and new_computed_blocks is None |
prefer this slightly.
And do we need to add num_computed_tokens > request.num_prompt_tokens to make remove_skipped_blocks to be called in the first decode step? This can help to free the prefill tokens used by the last prefill step but is outside the sliding window of the first decode step. Would be grateful if you can try gpt-oss and gemma3, two models with small sliding window size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And can you add a comment for this may delay remove_skipped_blocks and cache_blocks and give some analysis about it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, great call. I didn't notice the cache_blocks logic. Let me think about the early return logic more :/
Signed-off-by: Jialin Ouyang <[email protected]>
Purpose
As the title, early return in KVCacheManager.allocate_slots to speed up scheduling, as most of the requests only allocate a block every 16 steps.
Test Plan && Test Result
With the change, we could observed allocate slots cost is cut by half from 17.24% to 8.24% over the total runtime (with async scheduling enabled)

Total runtime distribution for allocate_slots per engine step

Public benchmarks
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.